{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Sampling and Hypothesis Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Variables | Examples in Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "import numpy as np\n",
    "from scipy import stats\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bernoulli Distribution\n",
    "\n",
    "The simplest discrete probability distribution is the **Bernoulli distribution**: \n",
    "\n",
    "$$B \\sim \\text{Bernoulli}(p)$$\n",
    "\n",
    "This describes a situation where there are only two possible outcomes, labelled \"success\" ($B=1$) and \"failure\" ($B=0$).\n",
    "\n",
    "The probability of obtaining a success is a constant, $p$.\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    " \\mathbb{P}(B = x) &= \\begin{cases}\n",
    "  p & \\text{for $x=1$}\\\\\n",
    "  1-p & \\text{for $x=0$}\n",
    "  \\end{cases}\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}B &= 1 \\cdot p + 0 \\cdot (1-p) = p\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}B &= \\mathbb{E}(B-p)^2 = (1-p)^2 \\cdot p + (0-p)^2 \\cdot (1-p) = p(1-p)\n",
    "\\end{align*}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: rolling a six with one die\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "six = stats.bernoulli(1/6)  # a Bernoulli distribution with p=1/6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability mass function\n",
    "x = np.arange(2)\n",
    "plt.plot(x,six.pmf(x), 'ro', ms=8)\n",
    "plt.vlines(x, 0, six.pmf(x), colors='r', lw=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value\n",
    "six.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance\n",
    "six.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binomial Distribution\n",
    "\n",
    "If $X$ is the number of successes in $n$ *independent and identically distributed* (i.i.d.) Bernoulli trials, with probability of success $p$, then $X$ is said to follow a **binomial distribution**: \n",
    "\n",
    "$$X = B_{1} + ... + B_{n} \\sim \\text{binom}(n,p)$$\n",
    "\n",
    "The probability of obtaining $x$ successes is given by\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  \\mathbb{P}(X = x) &= \\binom{n}{x}p^{x}(1-p)^{n-x}.\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\mathbb{E}( B_{1} + \\cdots + B_{n} ) = \\mathbb{E}B_{1} + \\cdots + \\mathbb{E}B_{n} = np\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\text{Var}( B_{1} + \\cdots + B_{n} ) = \\text{Var}B_{1} + \\cdots + \\text{Var}B_{n} = np(1-p)\n",
    "\\end{align*}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: number of sixes obtained when rolling ten dice\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sixes = stats.binom(10, 1/6)  # n=10, p=1/6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability mass function\n",
    "x = np.arange(11)\n",
    "plt.plot(x,sixes.pmf(x), 'ro', ms=8)\n",
    "plt.vlines(x, 0, sixes.pmf(x), colors='r', lw=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "plt.step(x,sixes.cdf(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value\n",
    "sixes.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance\n",
    "sixes.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate the probability of rolling one or more sixes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - sixes.pmf(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Poisson Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **Poisson distribution** describes the number of observations of an event that is randomly distributed in space or time.\n",
    "\n",
    "$$X \\sim \\text{Poisson}(\\lambda)$$\n",
    "\n",
    "e.g., number of radioactive decays in a second, number of accidents in a year, number of mutations on a chromosome.\n",
    "\n",
    "The probability of observing $x$ events is given by\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  \\mathbb{P}(X = x) &= \\frac{e^{-\\lambda}\\lambda^{x}}{x!} \\text{ for } x=0,1,2,...\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\lambda\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\lambda\n",
    "\\end{align*}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "impacts = stats.poisson(4) # e.g. an average of 4 meteorite impacts per year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability mass function\n",
    "x = np.arange(16)\n",
    "plt.plot(x,impacts.pmf(x), 'ro', ms=8)\n",
    "plt.vlines(x, 0, impacts.pmf(x), colors='r', lw=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "plt.step(x,impacts.cdf(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value\n",
    "impacts.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance\n",
    "impacts.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the probability of observing between 2 and 4 meteorite impacts in a given year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "impacts.cdf(4) - impacts.cdf(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uniform Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **uniform distribution** describes a continuous random variable with a flat pdf over a specified interval.\n",
    "\n",
    "$$X \\sim U(a,b)$$\n",
    "\n",
    "e.g. angle of a spinner, where $a=0$ and $b=360$.\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  f(x) &= \\frac{1}{b-a} \\text{ for } a \\le x \\le b\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\frac{1}{2}(a+b)\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\frac{1}{12}(b-a)^2\n",
    "\\end{align*}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "angle = stats.uniform(0,360) # e.g. angle of a spinner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "x = np.linspace(-30,390,100)\n",
    "plt.plot(x, angle.pdf(x), color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "plt.plot(x,angle.cdf(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the mean\n",
    "angle.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance\n",
    "angle.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the probability of spinning an angle between 90 and 180 degrees?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "angle.cdf(180) - angle.cdf(90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exponential Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **exponential distribution** describes waiting times between Poisson events.\n",
    "\n",
    "$$X \\sim \\text{exp}(\\lambda)$$\n",
    "\n",
    "e.g. time until a single U-238 atom decays.\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  f(x) &= \\lambda e^{-\\lambda x}\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\frac{1}{\\lambda}\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\frac{1}{\\lambda^2}\n",
    "\\end{align*}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lam = 4  # e.g. an average of 4 meteorite impacts per year.\n",
    "wait = stats.expon(0,1/lam)  # X describes the time until the first meteorite impact, in years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "x = np.linspace(0,2,100)\n",
    "plt.plot(x, wait.pdf(x), color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "plt.plot(x,wait.cdf(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the mean\n",
    "wait.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance\n",
    "wait.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the probability of observing a meteorite impact during the first half of the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wait.cdf(0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normal Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **normal distribution** (also known as the Gaussian distribution) describes many situations associated with measurement. Its parameters are the *mean*, $\\mu$, and the *variance*, $\\sigma^2$:\n",
    "\n",
    "$$X \\sim N(\\mu,\\sigma^2)$$\n",
    "\n",
    "e.g. measured thickness of a piece of paper\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  f(x) &= \\frac{1}{\\sqrt{2\\pi\\sigma^2}}e^{\\frac{(x-\\mu)^2}{2\\sigma^2}}\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\mu\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\sigma^2\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "The normal distribution can be used as an approximation to the binomial ( for large $n$ ) and the Poisson ( for large $\\lambda$ ).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mu = 200\n",
    "sigma = 20\n",
    "thickness = stats.norm(mu,sigma)  # paper thickness in microns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "x = np.linspace(100,300,100)\n",
    "plt.plot(x, thickness.pdf(x), color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "plt.plot(x,thickness.cdf(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the mean\n",
    "thickness.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance\n",
    "thickness.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What proportion of measurements are expected to be over 225 $\\mu m$?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - thickness.cdf(225)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log-normal distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many processes in biology, chemistry and the social sciences lead to variables that have **log-normal distributions**, that is, $\\log{X}$ follows a normal distribution."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
